-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TIFF: Add ASCII validation for ASCII fields #487
base: integration
Are you sure you want to change the base?
TIFF: Add ASCII validation for ASCII fields #487
Conversation
Fixes a bug in `readASCIIArray` which prevented the array from being populated. This went unnoticed because it was only used for one uncommon field (InkNames), but now the same logic is shared for all ASCII fields.
Codecov Report
@@ Coverage Diff @@
## integration #487 +/- ##
==============================================
Coverage 49.69% 49.69%
Complexity 987 987
==============================================
Files 55 55
Lines 7750 7750
Branches 1406 1406
==============================================
Hits 3851 3851
Misses 3435 3435
Partials 464 464 Continue to review full report at Codecov.
|
if (invalidAscii) { | ||
_info.setMessage(new ErrorMessage( | ||
MessageConstants.TIFF_HUL_72, value)); | ||
_info.setValid(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi I have mixed feelings with declaring invalid TIFF files just because of non ASCII (but ISO-8859-1) characters.
Indeed, I know, because this control doesn't exist, that we have thousand of images that have "Bibliothèque nationale de France" in comment but are otherwise perfectly fine.
It would be nice if we could make this configurable like the byteoffset=true
parameter to allow it to simply be a warning.
A parameter like invalidNonAsciiCharacters with the default value true will be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tledoux I've been working on a more generic way of ignoring / downgrading / upgrading errors by id using a configuration file passed at run-time. It might prove a useful addition as it makes reporting more flexible. Will do my best to get the PR out for the first part of next week for review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, yeah, we ourselves have quite a few (somewhat questionable) "©" symbols peppered about our TIFF collections. That was actually the motivator for the addition of this check. If JHOVE had had it earlier we would have picked up on this issue sooner and been able to remedy it with a better alternative.
The TIFF 5 and 6 specs are very clear that these fields should only contain "7-bit ASCII". Rendering software has no way to know what character encoding is being used in these fields if not ASCII, and so no characters outside of the ASCII range can be relied upon to render predictably outside of the software which wrote them. Given that, I think marking such TIFFs as invalid is actually fairly desirable, from a preservation perspective, though I certainly sympathize about now needing to deal with the problems our past mistakes have made for us.
The optional parameter isn't a bad idea, if you'd like to implement it, but it does make me think how much better a more general solution would be, allowing us to selectively suppress any messages using our cool newfangled message IDs! Alas... we currently set our validity statuses separately from our messages, so until message objects carry their own status modifiers, parameters may be the way to go.
Then again, a part of me also can't help but feel that by providing an explicit option for this check we may be implying that it's safe or reasonable to ignore this incompatibility under certain circumstances, which I don't believe is necessarily true. So I'm torn... but feel free to add what options you like :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carlwilson Just now saw your reply. Of course you're already working on it :)
InkNames
field values